Session 1
Introduction to R and RStudio
Basic data types
Functions
Arithmetic and logical operators
Vectors, data frames, and factors
Indexing
Reading data
Plotting with base R
RStudio has a four panel layout: documents, console,
environment, and stuff (plots, help, files, …)
Set the working directory
(Session > Set Working Directory) to the
working_area folder
Open a new document … File > New File >:
R Notebook: allows you to interleave R
code, comments, and images. An HTML is automatically produced that
records the last outputs when the document is saved
R Markdown: the same, but will an HTML is only
produced when asked, and all code is re-run (knit) before it’s
produced
Commands run in the documents pane and in the console will behave
identically. Run a line of code with ⌘+Enter (Mac) or
CTRL+Enter (Windows), or by typing it into the console
directly.
Run individual or highlighted lines of code with
⌘+Shift+Enter (Mac) or CTRL+Shift+Enter
(Windows)
Insert a code block at the position of the cursor with
⌘+⌥+I (Mac) or CTRL+Alt+I (Windows)
Press the ⏵ (run) button to run a code
block
Press ⌘+S (Mac) or CTRL+S (Windows) to
save
Let’s try some basic calculations. R knows all the usual
arithmetic operators: +, -, /,
*, and ^. By default, results are printed to
the screen.
## [1] 2
To store an answer, we must assign it with either the
<- or = operator.
Here, an object (or variable), a, is
created to store the calculation’s result. We can query what
a stores.
## [1] 5
Assignment works from right to left. Here, we can say that
b becomes equal to the result of
sqrt(25).
## [1] 5
Stored objects appear in RStudio’s environment panel. As
in algebra, they can be used in calculations.
## [1] 25
Assigning to an existing object will overwrite its previous value.
## [1] 25
As in all programming languages, in R, there are rules
about what names an object can be given.
Object names cannot start with a number …
… contain special characters (e.g. a minus sign) …
… or contain spaces.
Object names can only contain, numbers, characters, underscores, and points.
Names should be short and meaningful so that you can type them easily and know what they store.
tcga_data = ... # a nice, short, meaningful name
the_data_we_got_from_that_2019_paper = ... # not a good name!
t = ... # tulip? turtle? time?As spaces aren’t allowed, there are are two preferred options instead.
sqrt() was an example of a function. A function
runs a pre-determined calculation or procedure, generally (but not
always) returning an answer.
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 1
## [1] 10
R is extremely well documented. To find out what a
function does or how to use it, use the ? operator.
Arithmetic Mean
Description
Generic function for the (trimmed) arithmetic mean.
Usage
mean(x, ...)
## Default S3 method:
mean(x, trim = 0, na.rm = FALSE, ...)
...
We can also ask R to give an example of the function in
use … though some are a bit complicated!
The full function signature for mean() is
mean(x, trim = 0, na.rm = FALSE, ...).
Breaking it down:
the function name is mean
inputs to the function - its arguments - are provided within the brackets
required arguments do not have a preset value
(e.g. x). They are generally positional so that,
provided you pass them in the correct order, you don’t have to specify
their names.
mean(one_to_ten) # is the same as ...
mean(x=one_to_tex) # ... this. The names of positional arguments are implicittrim = 0) and, if they’re not specified, the preset
will be used. For clarity, when we do specify optional arguments, it’s
good practice to use their names.Ask R what these functions do:
mad
ceiling
floor
toupper
range
There are three basic data types that we use in
R day-to-day:
Numeric values store a number (with the subclasses
integer and double)
Character values (aka a string of
characters) wrapped in quotation marks store text
Logical values (aka a boolean) store a
TRUE/FALSE result
The type of an object can be determined using the
class() function.
## [1] "numeric"
## [1] "character"
Basic data types can be grouped into collections of
elements. A vector is one of the most fundamental,
storing elements of the same type.
The c() function combines the given elements, flattening
any other vectors along the way.
## [1] 1 2 3
## [1] "one" "two" "three" "four" "five"
## [1] TRUE FALSE TRUE FALSE
There are a number of handy functions for creating vectors.
The seq() function (or the : operator) will
create a sequence of numbers
## [1] 11 12 13 14 15 16 17 18 19 20
## [1] 11 12 13 14 15 16 17 18 19 20
The rep() function will repeat a given object a
specified number of times
## [1] 11 12 13 14 15 11 12 13 14 15 11 12 13 14 15
Let’s work out what these will do when making a vector:
seq(1, 10, 2)
10:1
c(1:10, 10:1)
Many of the core R functions are vectorised -
i.e. functions are written to operate on all elements of a vector
simultaneously.
Vector operations avoid the need to loop through and act on each element individually, making writing code more concise and less error prone. They’re also much faster!
## [1] 121 144 169 196 225 121 144 169 196 225 121 144 169 196 225
## [1] 2.397895 2.484907 2.564949 2.639057 2.708050 2.397895 2.484907 2.564949
## [9] 2.639057 2.708050 2.397895 2.484907 2.564949 2.639057 2.708050
Specific elements can be extracted from vectors by their position (or
index) using square brackets ([ ]).
## [1] 15
## [1] 11 12 13
Conversely, specific elements can be excluded by using a minus sign.
## [1] 12 14 11 12 13 14 15 11 12 13 14 15
The order of index values matters; [c(1, 2)] is
different to [c(2, 1)]). The order of the supplied index
values can be used to arrange data, therefore.
## [1] 15 14 13 12 11 15 14 13 12 11 15 14 13 12 11
order() sorts a vector and returns the index values in
the order that achieve this representation.
## [1] 1 6 11 2 7 12 3 8 13 4 9 14 5 10 15
## [1] 11 11 11 12 12 12 13 13 13 14 14 14 15 15 15
Name-based indexing of vectors is also possible.
Names can be assigned to an existing vector by using the
names() function.
## first second third
## 1 2 3
Alternatively, names can be supplied when the vector is first created.
## first second third
## 1 2 3
Named vectors can be indexed using square brackets ([ ])
(to retain the names) or with double square brackets
([[ ]]) (to discard them).
## second
## 2
## [1] 2
Indexing does not work beyond the bounds of a vector, however - you can’t index beyond its length
## [1] 3
## <NA>
## NA
Here, R uses the NA (not available) logical
constant to signify that a value is missing.
Similarly, you can’t index a named vector with a name that it does not contain.
## [1] "first" "second" "third"
## <NA>
## NA
Again, R uses the NA (not available)
logical constant to signify that a value is missing.
You have inherited 4 penguins from an estranged aunt!
Make a vector that records their ages in months: 15, 53, 97, and 64.
Have you chosen a good name?
Add the information that their names are Basil, Tim, Lisa, and Snapper.
What are their ages in complete years?
What is the median age of your penguins?
Specific elements within vectors can be replaced by referencing their index in an assignment operation. This works identically for position-based and name-based indexing.
## first second third
## 0 0 3
## first second third
## 1 2 3
Two objects can be compared to one another using logical operations:
x == y … is equal to
x != y … is not equal to
x > y … is greater than (>= or
equal to)
x < y … is less than (<= or equal
to)
These return a logical TRUE/FALSE answer.
## [1] FALSE
Logical operators are vectorised in R; when run on a
vector, they return a logical vector of the same length …
## first second third
## TRUE TRUE FALSE
… which can be used to subset the original vector.
## first second
## 1 2
Asking whether a variable is contained within a vector is an example
of a set membership query and can be performed using the
%in% operator. As with other logical operators, these are
vectorised in R …
## [1] FALSE FALSE TRUE
… and can be used for subsetting.
## third
## 3
As stated, when run on a vector, these operations return a logical vector of the same length.
If it’s preferable to just have a list of the indices that satisfy a
given logical criteria, the which() function can be used
instead …
## [1] 3
… which can be used to subset the original vector in exactly the same way.
## third
## 3
Single logical variables or vectors of them can be combined and
manipulated with the & (and), | (or), and
! (not) operators.
## [1] FALSE
## [1] TRUE TRUE FALSE
## [1] TRUE
Many Emperor penguins don’t start breeding until they’re 6 years old.
To bump up your numbers, you decide to get more penguins.
Vectors can be created from simple text files by using the
scan(file, what) function. We pass two important
arguments:
file … the location of the file to be read
what … the basic data type that it contains:
character, numeric, or logical
## [1] "Wow" "that" "was" "very" "easy"
The data.frame() function can be used to combine
individual vectors into columns of a unified table.
Separate columns can be used to store different data types but each column is still an individual vector … and vectors can only store a single data type.
## col1 col2
## 1 1 one
## 2 2 two
## 3 3 three
## [1] "data.frame"
As data frames are two dimensional, different functions are required to manipulate them.
Columns can be added using cbind() (column bind) …
## col1 col2 col3
## 1 1 one TRUE
## 2 2 two FALSE
## 3 3 three TRUE
… or using the special $ operator.
## col1 col2 col3 col4
## 1 1 one TRUE 1.1
## 2 2 two FALSE 1.2
## 3 3 three TRUE 1.3
Similarly, instead of names(), there are
rownames() …
## col1 col2 col3 col4
## a 1 one TRUE 1.1
## b 2 two FALSE 1.2
## c 3 three TRUE 1.3
… and colnames().
## COL1 COL2 COL3 COL4
## a 1 one TRUE 1.1
## b 2 two FALSE 1.2
## c 3 three TRUE 1.3
In a data frame, column and row names must be unique. Duplicate names are not allowed!
Instead of length(), to get the size of a data frame, we
can use nrow() …
## [1] 3
… ncol() …
## [1] 4
… or, to get both together, dim().
## [1] 3 4
To access the information within a data frame, as with vectors, we
use square brackets ([ ]). However, we must now define both
row and column positions:
[r, ] … index values before the comma refer to
rows
[, c] … index values after the comma refer to
columns
Omitting either one, as above, will return all possibilities.
## COL1 COL2 COL3 COL4
## b 2 two FALSE 1.2
## [1] "one" "two" "three"
As with named vectors, we can also index by row and column name …
## COL1 COL2 COL3 COL4
## a 1 one TRUE 1.1
## b 2 two FALSE 1.2
## [1] 1 2 3
… and we can access columns individually with the $
operator, which returns the underlying vector.
## [1] 1 2 3
Tabular plain-text formats can be exported from most software. In
Excel it’s easy to export data as either tab-separated
(.tsv / .txt) or as comma-separated
(.csv) files directly when saving a document.
Tab-separated and comma-separated data can be imported using the
read.delim() and read.csv() functions,
respectively.
Data frames can be very large and cumbersome to view. To make things easier, we can use:
head() to display the first few rows
tail() to display the last few rows
View(), which in RStudio will open a
separate tab to display the contents
str() to view its structure and the basic types of
its component vectors
table() and summmary() to view compact
representations of specified columns
## ID Name Sex Age HatColour
## 1 E1 Basil M 15 blue
## 2 E2 Tim M 53 red
## 3 E3 Lisa F 97 yellow
## 4 E4 Snapper M 64 green
## 5 E5 Erica F 99 yellow
## 6 E6 Jen F 107 red
## 'data.frame': 10 obs. of 5 variables:
## $ ID : chr "E1" "E2" "E3" "E4" ...
## $ Name : chr "Basil" "Tim" "Lisa" "Snapper" ...
## $ Sex : chr "M" "M" "F" "M" ...
## $ Age : int 15 53 97 64 99 107 60 150 131 78
## $ HatColour: chr "blue" "red" "yellow" "green" ...
Often, we need to merge data frames that have been stored separately.
The merge() function will automate merger by matching up
values from two tables based on a specified (key) column.
## [1] "ID" "Height" "Weight"
## ID Name Sex Age HatColour Height Weight
## 1 E1 Basil M 15 blue 120 35.6
## 2 E2 Tim M 53 red 115 33.2
## 3 E3 Lisa F 97 yellow 125 37.3
## 4 E4 Snapper M 64 green 127 38.0
## 5 E5 Erica F 99 yellow 110 30.1
## 6 E6 Jen F 107 red 119 36.5
So, you’ve started to record things about your penguins. We’re not judging.
How many penguins do you now have?
Are you sure? What happens if you merge the data frames with
merge(..., all=TRUE)?
Why is this different?
What are the average ages of the female and, separately, of the male penguins?
What function might we use to work out a unique list of the different hat colours they have?
How could we change the order of the rows if we wanted to arrange by age?
What does table(penguins$HatColour) do? How about
table(penguins$Sex, penguins$HatColour)?
Data frames and vectors containing NA values can require
special treatment when running functions or calculations.
## [1] NA NA
## [1] NA
We can sanitise an entire data frame by using the
na.omit() function to drop all NA-containing
rows …
… but, unless we want to discard data, it’s often better to just drop
NAs for specific calculations. Many functions take the
na.rm=TRUE argument to facilitate this
## [1] 110 130
Where vectors store categorical data, we can convert them to
a special factor class, which allows their use with
specialised functions.
A factor is suitable where we might logically group the rows by it:
Sex can be grouped … it’s a factor
Age is a continuous spread … it’s not a
factor
ID is a unique identifier for each row, there’s
probably no advantage to it being a factor
Further, there are two subclasses of factors:
nominal factors are inherently unordered
(e.g. Sex)
ordinal factors are inherently ordered (e.g. timepoint)
Factors can be created using the factor() function on a
vector …
## [1] M F M F M F F M M F
## Levels: F M
## [1] "factor"
… and its levels can be viewed.
## [1] "F" "M"
The ordering of factor levels is important, as the first is taken as the reference. By default, levels are sorted numerically or lexicographically.
For nominal factors, this can be controlled by specifying either a different reference level …
## [1] M F M F M F F M M F
## Levels: M F
… or by re-applying the factor() function and providing
a different order.
## [1] M F M F M F F M M F
## Levels: M F
For ordinal factors, this is controlled by re-applying the function
with factor(..., ordered=TRUE).
## [1] d0 d1 d3 d7 d10
## Levels: d0 d1 d10 d3 d7
## [1] d0 d1 d3 d7 d10
## Levels: d0 < d1 < d3 < d7 < d10
## [1] TRUE
The level of an element can be switched (provided the replacement level is already defined) in the same way a vector is modified.
## [1] d3 d1 d3 d7 d10
## Levels: d0 < d1 < d3 < d7 < d10
The levels() function can also be used to rename levels
en masse and adjust the elements correspondingly.
## [1] day3 day1 day3 day7 day10
## Levels: day0 < day1 < day3 < day7 < day10
However, to change an individual element to a previously undefined level, it must be added first.
levels(timepoint) = c("day0", "day1", "day3", "day7", "day10", "day14")
timepoint[1] = "day14"
timepoint## [1] day14 day1 day3 day7 day10
## Levels: day0 < day1 < day3 < day7 < day10 < day14
If, as a result of modification or subsetting, levels become unused,
they can be removed using the droplevels() function.
## [1] day14 day1 day3 day7 day10
## Levels: day1 < day3 < day7 < day10 < day14
You have realised that HatColour contains categorical
data and should be turned into a vector.
Update the penguins data frame to make this
change.
Is HatColour nominal or ordinal?
Penguin E1 (Basil) does not like having a
blue hat and would prefer red. What steps do
we need to take to fully update our data frame?
We will cover plotting in more detail using the ggplot2
library in Session 4. ‘Base’ R can still
do some basic plotting, though, and has several types of graph built
in:
plot() produces scatter plots
boxplot()
barplot()
histogram()
As with other functions, we can find out how to use these functions
with the ? operator.
boxplot expects a formula (of the form
y ~ x) as its single positional argument and we can pass a
data frame to it using the data optional argument.
plot, somewhat cryptically, says “any reasonable way of
defining the coordinates is acceptable” when describing its input
format. Sufficed to say that one reasonable way is to use a formula and
data, as we did for the boxplot.
Columns defined as factors can be used to modify the aesthetics of
the plot. For example, we can edit the colour and shape of points based
on Sex. It makes sense to then also use the
legend() function to show what we’ve done.
plot(Height~Weight, data=na.omit(penguins), col=c("purple", "green4")[Sex], pch=c("♀", "♂")[Sex])
legend("topleft", legend=levels(penguins$Sex), col=c("purple", "green4"), pch=c("♀", "♂"))There are some penguin-related homework tasks to help cement what we’ve covered today!
The homework and instructions can be found within the main directory
for the course: ./homework/Homework_1.Rmd
Comments
Comments in your code allow you to keep track of your ideas and to document your work. In
R, on each line, everything after a#symbol is considered a comment and will not be run.We can highjack this system to ‘comment out’ code sections if we don’t want them to run.
For the current or highlighted lines, the
⌘+Shift+C(Mac) orCTRL+Shift+C(Windows) will comment and un-comment lines for you inRStudio.